The objective of the analysis of chips segment is to help strategize plans for the next half year. The analysis comprises of identifying customer purchasing behaviour, in order to generate insights and provide commercial recommendations.
The workflow followed can be summarized as:
(Plotly library for graphs and charts.)
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.templates.default = "plotly_white"
import statsmodels.api as sm
from scipy import stats
# to identify the encoding of the datasets
# for file in ["QVI_purchase_behaviour.csv", "QVI_transaction_data.xlsx"]:
# result = open(file = file, mode = "r")
# print(result)
# read the purchase behaviour dataset
purchase_behaviour = pd.read_csv("QVI_purchase_behaviour.csv")
# read the transactions dataset
transaction_data = pd.read_excel("QVI_transaction_data.xlsx")
purchase_behaviour.info()
pd.options.display.float_format = "{:.3f}".format
print("\n")
purchase_behaviour.describe(include = "all")
# dispaly few rows from purchase behaviour
purchase_behaviour.head().append(purchase_behaviour.tail())
transaction_data.info()
print("\n")
transaction_data.describe()
# display few rows from transactions
transaction_data.head().append(transaction_data.tail())
from datetime import datetime as dt
transaction_data["DATE"] = (transaction_data["DATE"].apply(
lambda date: dt.fromordinal(dt(1900, 1, 1).toordinal() + date - 2)))
transaction_data.head()
# for ease of code, applying lower case to all column names in both the data sets
purchase_behaviour.columns = [each.lower() for each in purchase_behaviour.columns]
transaction_data.columns = [each.lower() for each in transaction_data.columns]
cust_dist_df = pd.pivot_table(purchase_behaviour, index = ["lifestage", "premium_customer"],
values = ["premium_customer"], aggfunc = "size").reset_index()
cust_dist_df.rename({0 : "count", }, axis = 1, inplace = True)
fig = px.bar(data_frame = cust_dist_df,
y = "lifestage",
x = "count",
barmode = "group",
orientation = "h",
color = "premium_customer",
labels = {"lifestage" : "Customer's Lifestage",
"count" : "Total Customers",
"premium_customer" : "Customer Group"},
title = {"text": "Overview of Total Customers by Group and Lifestage", "font_size" : 20, "x" : 0.5},
color_discrete_sequence = px.colors.diverging.Portland,
)
fig.show()
It is evident the customer group Mainstream is a prominent contributor for sales (this overview includes all products).
New Families show the least sales.
As these columns represent quantitative data, outliers have been identified for these two columns only.
fig = make_subplots(rows = 2, cols = 1, shared_xaxes = False)
fig.add_trace(go.Box(x = transaction_data["tot_sales"], name = "Total Sales"), row = 1, col = 1)
fig.add_trace(go.Box(x = transaction_data["prod_qty"], name = "Product Quantity"), row = 2, col = 1)
fig.update_layout(width = 900, height = 250, showlegend = False,
title = "Outliers in Total Sales and Product Quantity",
font = {"size": 14})
fig.show()
For both product quantity and sales, there is only one extreme outlier for each. Let's filter all the transactions which constitute these two values.
# analysis of outliers
transaction_data[(transaction_data["tot_sales"] > 600) | (transaction_data["prod_qty"] > 150)]
# customer details associated with the outliers
loyalty_number = (transaction_data.loc[(transaction_data["tot_sales"] > 600) | (transaction_data["prod_qty"] > 150),
"lylty_card_nbr"])
purchase_behaviour[purchase_behaviour["lylty_card_nbr"].isin(loyalty_number) == True]
One interesting thing to note here - there are only 9 transactions in almost a year at the store 226!? Is this scenario applicable for other stores as well?
# remove outlier from transaction data
transaction_data = transaction_data[transaction_data["lylty_card_nbr"].isin(loyalty_number) == False]
fig = make_subplots(rows = 1, cols = 2, shared_yaxes = True,
subplot_titles = ["Distribution", "Correlation"])
fig.add_box(y = transaction_data["prod_qty"], name = "Product Qty", row = 1, col = 1, )
fig.add_box(y = transaction_data["tot_sales"], name = "Total Sales", row = 1, col = 1,)
fig.add_scattergl(x = transaction_data["prod_qty"],
y = transaction_data["tot_sales"],
mode = "markers", name = "",
hoverinfo = "skip",
row = 1, col = 2,
)
fig.update_layout(title = {"text":"Distribution and Correlation of Total Sales and Product Quantity",
"font_size" : 20, "x" : 0.5},
showlegend = False,
)
fig.show()
The correlation between product quantity purchased and sales made is heteroscedastic.
tot_sales_df = (pd.pivot_table(data = transaction_data, index = ["date"],
values = "tot_sales", aggfunc = np.sum).reset_index().round(2))
fig = px.scatter(data_frame = tot_sales_df,
x = "date", y = "tot_sales",
color = "tot_sales", size = "tot_sales", size_max = 12,
title = {"text": "Total Sales Trend for the Year for All Products", "font_size" : 20, "x" : 0.5},
labels = {"date": "Date", "tot_sales": "Total Sales"},
color_continuous_scale = px.colors.diverging.Portland,
)
fig.update_layout(xaxis_title = "Month-Year")
fig.show()
The highest sales has been recorded for the month of Dec 2018 and the lowest sales have been recorded for Aug 2018 and May 2019.
print("Extracted brand names:")
for each_brand in transaction_data["prod_name"].str.split().str[0].value_counts(dropna = False).sort_index().index:
print(each_brand)
Brand names like Smith and Smiths, GrnWves and Grain Waves, etc. represent the same brand. However with unformated spellings they appear to be different. Time for some clean-up!
brand_names = {"RRD": "Red Rock Deli", "NCC" : "Natural Chips Co", r"\bDorito\b" : "Doritos", "GrnWves" : "Grain Waves",
r"\bSmith\b" : "Smiths", "Snbts" : "Sunbites", "WW" : "Woolworths", "Infzns" : "Infuzions"}
transaction_data["prod_name"].replace(brand_names, regex = True, inplace = True)
transaction_data["brands"] = (transaction_data["prod_name"].str.replace(r"[^\w\s]+", "").
str.replace(r"\d+.", " ").str.replace(r"\s+", " ").str.split().str[0].str.strip())
Here I have deviated from what was suggested in the sample project. I found these two products - Red Rock Deli SR Salsa Mzzrlla, and Smiths Crinkle Cut Tomato Salsa which are potato chip or like, but have a salsa flavour. So if we just remove the products with name containing Salsa, we loose these products as well. Although they are just two products out of the 107 (= 114 - 7 genuine Salsa products) chips products, I didn't drop them from transactions data.
So instead of a general drop code, I specified the Salsa products as a list and then selected only the true Non-Salsa products.
transaction_data.loc[transaction_data["prod_name"].str.contains("Salsa") == True, "prod_name"].unique()
non_chip_product = ['Old El Paso Salsa Dip Tomato Mild',
'Woolworths Mild Salsa',
'Doritos Salsa Medium',
'Old El Paso Salsa Dip Chnky Tom Ht',
'Old El Paso Salsa Dip Tomato Med',
'Woolworths Medium Salsa',
'Doritos Salsa Mild']
transaction_data = transaction_data[transaction_data["prod_name"].isin(non_chip_product) != True]
transaction_data["pack_size_gms"] = transaction_data["prod_name"].str.extract(pat = "(\d+)")
transaction_data["pack_size_gms"] = transaction_data["pack_size_gms"].str.strip().str.lower().astype(int)
import matplotlib.pyplot as plt
%matplotlib inline
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (12, 4))
transaction_data["pack_size_gms"].hist( ax = ax1)
transaction_data["pack_size_gms"].plot.kde( ax = ax2)
ax1.set_title("Distribution of Pack Sizes", size = 16)
ax1.set_xlabel("Pack Size", size = 12)
ax1.set_ylabel("Total Transactions", size = 12)
ax2.set_title("Kernel Density Estimate of Pack Sizes", size = 16)
ax2.set_xlabel("Pack Size", size = 12)
ax2.set_ylabel("Density", size = 12)
plt.tight_layout()
plt.show()
chips_retail = pd.merge(right = purchase_behaviour,
left = transaction_data,
on = "lylty_card_nbr",
how = "left")
# check for any null values
chips_retail.isnull().sum()
# generate ".csv" file for task 2 of the project
# chips_retail.to_csv("chips_retail.csv", index = False)
segment_df = pd.pivot_table(data = chips_retail,
values = ["tot_sales", "prod_qty"],
index = ["lifestage", "premium_customer"],
aggfunc = {"tot_sales": [np.sum, np.mean],
"prod_qty" : [np.sum, np.mean]}
)
segment_df.columns = segment_df.columns.map("_".join)
segment_df.reset_index(inplace = True)
segment_df = segment_df.round(5)
fig = make_subplots(rows=1, cols=3, specs=[[{"type": "domain"}, {"type": "domain"},{"type": "domain"}]])
segments = ["Budget", "Premium", "Mainstream"]
for each_group in segments:
# add pie trace to each plot
fig.add_trace(go.Pie(labels = segment_df.loc[segment_df["premium_customer"] == each_group, "lifestage"],
values = segment_df.loc[segment_df["premium_customer"] == each_group, "tot_sales_sum"],
name = each_group,
),
row = 1, col = segments.index(each_group) + 1)
fig.update_traces(hole = 0.6,
hoverinfo="label+value+name",
sort = False,
marker_colors = px.colors.diverging.Portland,
scalegroup = "one",
hovertext = "label+value+name")
fig.update_layout(legend = {"orientation": "h", "yanchor": "bottom", "y": -0.25, "xanchor": "center", "x": 0.5},
title = {"text": "Total Sales by Customer Group and Customer's Lifestage",
"font_size": 20, "x" : 0.5},
annotations = [{"text" : "Budget", "x" : 0.1, "y" : 0.5, "font_size": 15, "showarrow": False},
{"text" : "Premium", "x" : 0.5, "y" : 0.5, "font_size": 15, "showarrow": False},
{"text" : "Mainstream", "x" : 0.91, "y" : 0.5, "font_size": 15, "showarrow": False}],
height = 440)
fig.show()
in Total Sales for respective customer group.
px.scatter(data_frame = segment_df,
y = "lifestage",
x = "tot_sales_mean",
symbol = "premium_customer", color = "premium_customer",
size = "tot_sales_mean", size_max = 12,
labels = {"lifestage": "Customer's Life Stage",
"premium_customer" : "Customer Class",
"tot_sales_mean" : "Average Sales per unit for the Segment"},
height = 350,
title = {"text": "Average Sales per unit for Customer Class and Life Stage Segment",
"font_size" : 20},
color_discrete_sequence = px.colors.diverging.Portland
)
px.scatter(data_frame = segment_df,
y = "lifestage",
x = "prod_qty_mean",
symbol = "premium_customer", color = "premium_customer",
size = "prod_qty_mean", size_max = 12,
labels = {"lifestage": "Customer's Life Stage",
"premium_customer" : "Customer Group",
"prod_qty_mean" : "Average Quantity for the Segment"},
height = 350,
title = {"text": "Average Quantity Purchased by Customer Group and Life Stage Segment",
"font_size": 20},
color_discrete_sequence = px.colors.diverging.Portland
)
Similar to the average sales, the average quantity purchased by the customers from each of the segment does not have a higher variance.
This t-test will help us understand if the average price spent by Mainstream Midage & Young Single/ Couple is significantly different from average price spent by Premium & Budget Midage & Young Single/ Couple Customers.
Hypothesis for the test:
$H_0 : \bar{x}_M = \bar{x}_{(B \space or \space P)}$
$H_{\alpha} : \bar{x}_M \neq \bar{x}_{(B \space or \space P)}$
At significance or $\alpha = 5\%$
I was not sure if we had to take Budget and Premium together or spearately for this test, so I did the test for both options inidividually and combined.
mainstream = chips_retail.loc[(chips_retail["premium_customer"] == "Mainstream") &
(chips_retail["lifestage"].isin(["MIDAGE SINGLES/COUPLES",
"YOUNG SINGLES/COUPLES"])), "tot_sales"]
budget = chips_retail.loc[(chips_retail["premium_customer"] == "Budget") &
(chips_retail["lifestage"].isin(["MIDAGE SINGLES/COUPLES",
"YOUNG SINGLES/COUPLES"])), "tot_sales"]
premium = chips_retail.loc[(chips_retail["premium_customer"] == "Premium") &
(chips_retail["lifestage"].isin(["MIDAGE SINGLES/COUPLES",
"YOUNG SINGLES/COUPLES"])), "tot_sales"]
var = [np.var(each) for each in [mainstream, budget, premium]]
t_stats, p_value = stats.ttest_ind(a = mainstream, b = budget, equal_var = False)
print("For Mainstream vs Budget test statistics and the p-value are {:.3f} and {:.5f}".format(t_stats, p_value))
t_stats, p_value = stats.ttest_ind(a = mainstream, b = premium, equal_var = False)
print("For Mainstream vs Premium test statistics and the p-value are {:.3f} and {:.5f}".format(t_stats, p_value))
# budget and premium added together
bgt_prm = chips_retail.loc[(chips_retail["premium_customer"] != "Mainstream") &
(chips_retail["lifestage"].isin(["MIDAGE SINGLES/COUPLES",
"YOUNG SINGLES/COUPLES"])), "tot_sales"]
np.var(bgt_prm)
t_stats, p_value = stats.ttest_ind(a = mainstream, b = bgt_prm, equal_var = False)
print("For Mainstream vs Others test statistics and the p-value are {:.3f} and {:.5f}".format(t_stats, p_value))
Based on the t-test results, we can conclude that, at 5% significance, statistically, average sales for Mainstream - midage & young singles/ couples segment is different as compared to
(Affinity Analysis is fairly a new subject for me)
The chips brand most preferred, in comparison to all other brands, by Mainstream - Young Singles/ Couples segment, is Tyrrells. Twisties and Kettle are the next top contenders.
The products under Tyrrells brands are, which both are packaged in 165 gms packs:
main_segment_qty_sum = chips_retail.loc[(chips_retail["premium_customer"] == "Mainstream") &
(chips_retail["lifestage"] == "young singles/couples".upper()), "prod_qty"].sum()
others_qty_sum = chips_retail.loc[(chips_retail["premium_customer"] != "Mainstream") &
(chips_retail["lifestage"] != "young singles/couples".upper()), "prod_qty"].sum()
main_segment = chips_retail.loc[(chips_retail["premium_customer"] == "Mainstream") &
(chips_retail["lifestage"] == "young singles/couples".upper())]
others = chips_retail.loc[(chips_retail["premium_customer"] != "Mainstream") &
(chips_retail["lifestage"] != "young singles/couples".upper())]
main_segment_ratio = pd.Series(main_segment.groupby("brands")["prod_qty"].sum()/main_segment_qty_sum)
others_ratio = pd.Series(others.groupby("brands")["prod_qty"].sum()/others_qty_sum)
brand_affinity = pd.concat([main_segment_ratio, others_ratio,
pd.Series(main_segment_ratio / others_ratio)], axis = 1).reset_index()
brand_affinity.columns = ["brands", "mainstream_young", "others", "brand_affinity"]
brand_affinity.sort_values("brand_affinity", ascending = False)
print(chips_retail.loc[chips_retail["brands"] == "Tyrrells", "prod_name"].unique())
fig = px.bar_polar(data_frame = brand_affinity.round(3),
theta = "brands",
r = "brand_affinity",
color = "brand_affinity",
color_continuous_scale = px.colors.diverging.Portland,
title = {"text": "Brand Affinity for Mainstream - Young Singles/Couples Segment",
"font_size" : 20, "x" : 0.5},
labels = {"brand_affinity": "Affinity"},
height = 600
)
fig.show()
The pack size most preferred, in comparison to all other packs, by Mainstream - Young Singles/ Couples segment, is weighs 270gms followed by the 380gms and 330gms pack.
The two products available in 270 gms pack are:
which again is the second most preferred brand for the customer.
main_segment_ratio = pd.Series(main_segment.groupby("pack_size_gms")["prod_qty"].sum()/main_segment_qty_sum)
others_ratio = pd.Series(others.groupby("pack_size_gms")["prod_qty"].sum()/others_qty_sum)
pack_affinity = pd.concat([main_segment_ratio, others_ratio,
pd.Series(main_segment_ratio / others_ratio)], axis = 1).reset_index()
pack_affinity.columns = ["pack_size_gms", "mainstream_young", "others", "pack_affinity"]
pack_affinity["pack_size_gms"] = pack_affinity["pack_size_gms"].astype(str).apply(lambda s: s + "g")
pack_affinity.sort_values("pack_affinity", ascending = False)
print([each for each in chips_retail.loc[chips_retail["pack_size_gms"] == 270, "prod_name"].unique()])
fig = px.bar_polar(data_frame = pack_affinity.round(3),
theta = "pack_size_gms",
r = "pack_affinity",
color = "pack_affinity",
color_continuous_scale = px.colors.diverging.Portland_r,
title = {"text": "Pack Size Affinity for Mainstream - Young Singles/Couples Segment",
"font_size" : 20, "x" : 0.5},
labels = {"pack_affinity": "Affinity"},
height = 600
)
fig.show()
df = chips_retail.groupby("date", as_index = False)["tot_sales"].sum()
df = df.append({"date" : "2018-12-25", "tot_sales" : np.nan}, ignore_index = True)
df["date"] = pd.to_datetime(df["date"]).dt.strftime("%Y-%m-%d")
df = df.sort_values("date")
fig = go.Figure()
fig.add_trace(go.Scatter(x = df["date"],
y = df["tot_sales"],
mode = "lines", line = {"color" : px.colors.diverging.Portland[0]}))
fig.update_layout(title = {"text" : "All Year Total Sales for Chips", "font_size" : 20, "x" : 0.5},
height = 450,
xaxis = {"title" : "Date-Month-Year"},
yaxis = {"title" : "Total Sales for the Day", "range" : [3700, 7100]},
font_size = 13)
fig.show()
To further analyze the Aug 2018, Dec 2018 and May 2019.
df = chips_retail.groupby(["date", "premium_customer"])["tot_sales"].sum().reset_index().round(2)
y_limits = [min(df["tot_sales"]) - 100, max (df["tot_sales"]) + 100]
from datetime import datetime as dt
# df["month"] = df["date"].dt.strftime("%B")
fig = px.scatter(data_frame = df,
x = df["date"].dt.strftime("%d"),
y = "tot_sales",
color = "premium_customer", symbol = "premium_customer",
size = "tot_sales", size_max = 12,
animation_frame = df["date"].dt.strftime("%B-%Y"),
height = 400,
range_y = y_limits,
color_discrete_sequence = px.colors.diverging.Portland,
labels = {"x": "Day", "tot_sales" : "Total Sales",
"premium_customer" : "Customer Group", "animation_frame" : "Month-Year"}
)
fig["layout"].pop("updatemenus")
fig.update_layout(title = {"text" : "Month-wise Sales", "font_size" : 20, "x": 0.5}, font_size = 14)
fig.show()
Out of the three customer groups - Premium, Budget & Mainstream - Mainstream contributes maximum to the potato chips sales. Lifestage wise, Budget Older Families, Mainstream - Young Singles/ Couples and Retirees customers contribute highest among all segments.
Mainstream - Midage & Young Singles/Couples buyers are inclined 23% higher, towards the chip brand Tyrells as compared to all other brands. The average spends of Mainstream - Midage & Young Singles/Couples customers is also higher than the Budget and Premium customers for the same lifestages.
We can conclude that Chips Segment should focus on Tyrells & Twisties brands towards the Mainstream Young Singles/Couples segment to help drive sales further.